Failure Prediction and Scalable Checkpointing for Reliable Large-Scale Grid Computing

نویسندگان

  • Brent Rood
  • John Paul Walters
  • Vipin Chaudhary
  • Michael J. Lewis
چکیده

Computational clusters, the grids that federate them, and the applications that utilize their significant computing potential, all continue to grow with advances in hardware technology, cluster management, and grid middleware solutions. As they do, the likelihood that large-scale long-running grid and cluster applications will have to deal with underlying node unavailability and cluster failure increases as well. The primary weapons against this problem— checkpointing, migration, replication, and effective scheduling—do not currently scale well enough to be effective for the largest, most important grid and cluster applications. Complementary research efforts in upstate New York are beginning to address this issue at a variety of levels, including: (i) low level mechanisms that will predict individual processor failures by observing and reacting to low-level indicators in their chip state; (ii) scalable cluster-level checkpointing solutions that do not require centralized storage for replicated checkpoints; (iii) grid-level efforts to differentiate between different node unavailability states, to characterize the behavior of nodes, to predict their near-future unavailability, and to make better grid scheduling decisions based on this information, and on characteristics and capabilities of applications.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...

متن کامل

E-science Workflow on the Grid

Grid computing, which can be characterized as large-scale distributed resource sharing and cooperation, has quickly become a mainstream technology in distributed computing. In this paper, we present the idea of applying certain grid workflow management techniques to mediate various services for grid-based e-science processes. The techniques of adaptable workflow services, aggressive sub-workflo...

متن کامل

GCIMCA: A Globus and SPRNG Implementation of a Grid-Computing Infrastructure for Monte Carlo Applications

The implementation of large-scale Monte Carlo computation on the grid benefits from state-of-the-art approaches to accessing a computational grid and requires scalable parallel random number generators with good quality. The Globus software toolkit facilitates the creation and utilization of a computational grid for large distributed computational jobs. The Scalable Parallel Random Number Gener...

متن کامل

Checkpoint Based Recovery Aware Component System in Grid Computing

Grids are distributed systems that dynamically coordinate a large number of heterogeneous resources to execute large scale projects involving collaborating teams of scientists, high performance computers, massive data stores, high bandwidth networking, and/or scientific instruments like telescopes, and synchrotrons. Failure in grids is arguably inevitable due to the massive scale and the hetero...

متن کامل

Provenance Based Checkpointing Method for Dynamic Health Care Smart System

Smart systems in telemedicine frequently use intelligent sensor devices at large scale. Practitioners can monitor non-stop the vital parameters of hundreds of patients in real-time. The most important pillars of remote patient monitoring services are communication and data processing. Large scale data processing is done mainly using workflows. Some workflows are working in real-time, more compl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007